GPQA Diamond

A graduate-level, Google-proof science benchmark where PhD experts reach only 65% — and frontier AI models now surpass them

Published

August 20, 2025

Keywords: GPQA Diamond, AI benchmark, graduate-level science QA, Google-proof questions, PhD-level evaluation, frontier LLM benchmark, physics chemistry biology, expert-level reasoning, COLM 2024, NYU benchmark

Introduction

Most AI benchmarks, even challenging ones like MMLU, have been saturated by frontier models — with leading systems scoring over 90%. This makes them useless for distinguishing between state-of-the-art models or measuring genuine scientific reasoning.

GPQA Diamond is different. It is the hardest, most vetted subset of the Graduate-Level Google-Proof QA Benchmark — 198 multiple-choice questions in biology, physics, and chemistry so difficult that PhD-level domain experts only reach 65% accuracy. Non-expert validators, given over 30 minutes with full internet access, score only 34% — making these questions truly “Google-proof.”

“We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: even highly skilled non-expert validators only reach 34% accuracy, despite spending over 30 minutes with unrestricted access to the web.” — GPQA Paper

graph LR
    A["Traditional Benchmarks<br/>(MMLU, etc.)<br/>90%+ accuracy"] --> B["Benchmark<br/>Saturation"]
    B --> C["GPQA Diamond<br/>198 PhD-level questions<br/>Experts: 65%"]
    C --> D["Meaningful signal<br/>for frontier AI<br/>reasoning"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is GPQA Diamond?

GPQA (Graduate-Level Google-Proof QA) is a benchmark of 448 multiple-choice questions across three science domains. It comes in three subsets of increasing difficulty and quality:

Subset Questions Description
GPQA Extended 546 All collected questions including lower-quality ones
GPQA Main 448 Filtered for quality and difficulty
GPQA Diamond 198 Hardest subset — both the expert author AND an independent expert validator agreed on the correct answer

Why “Diamond”?

The Diamond subset applies the strictest quality filter: a question is included only if both the original expert who wrote it and a separate domain expert (who independently attempted it) agreed on the correct answer. This double-validation ensures every question is:

  1. Unambiguously correct — verified by two independent experts
  2. Genuinely difficult — not solvable through surface-level reasoning or web search
  3. High signal — provides maximum information about model capabilities

Key Characteristics

Feature Details
Total questions 198 (Diamond subset)
Domains Biology, Physics, Chemistry
Question type Multiple-choice (4 options)
Expert accuracy 65% (PhD-level domain experts)
Non-expert accuracy 34% (with 30+ minutes and full web access)
Original GPT-4 baseline 39% (November 2023)
License CC-BY-4.0

What Makes It “Google-Proof”?

graph TD
    Q["PhD-level science<br/>question posed"] --> E["Domain Expert<br/>(PhD holder)<br/>65% accuracy"]
    Q --> N["Non-Expert Validator<br/>(30+ min, full web)<br/>34% accuracy"]
    Q --> M["GPT-4<br/>(Nov 2023 baseline)<br/>39% accuracy"]

    E --> V{"Both experts<br/>agree on answer?"}
    V -->|Yes| D["Included in<br/>GPQA Diamond"]
    V -->|No| X["Excluded from<br/>Diamond subset"]

    style Q fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style N fill:#e74c3c,color:#fff,stroke:#333
    style M fill:#f39c12,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style X fill:#95a5a6,color:#fff,stroke:#333

The term “Google-proof” means that non-expert validators — intelligent individuals without domain-specific PhD training — cannot solve these questions even with unlimited internet access. The questions require deep conceptual understanding, multi-step reasoning, and expert-level domain knowledge that cannot be pieced together from search results alone.

Who Built It?

GPQA was developed at New York University (NYU) by:

  • David Rein — Lead author
  • Betty Li Hou, Asa Cooper Stickland, Jackson Petty — Core researchers
  • Richard Yuanzhe Pang, Julien Dirani, Julian Michael — Contributing researchers
  • Samuel R. Bowman — Senior advisor (NYU)

Publication

GPQA was published at the First Conference on Language Modeling (COLM 2024), one of the premier venues for language model research.

Resource Link
arXiv paper arxiv.org/abs/2311.12022
GitHub repository github.com/idavidrein/gpqa
Hugging Face dataset huggingface.co/datasets/Idavidrein/gpqa
Conference COLM 2024 (First Conference on Language Modeling)

What Skills Does It Test?

GPQA Diamond tests deep expert-level scientific reasoning — not surface-level knowledge retrieval.

graph TD
    GPQA["GPQA Diamond<br/>198 questions"] --> P["Physics<br/>Quantum mechanics,<br/>thermodynamics,<br/>relativity"]
    GPQA --> C["Chemistry<br/>Organic reactions,<br/>spectroscopy,<br/>molecular structure"]
    GPQA --> B["Biology<br/>Molecular biology,<br/>genetics,<br/>biochemistry"]

    style GPQA fill:#e74c3c,color:#fff,stroke:#333
    style P fill:#3498db,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333

Capability What GPQA Diamond Tests
Graduate-level knowledge Questions require PhD-level understanding in specific subfields
Multi-step reasoning Most questions demand chaining multiple concepts together
Resistance to search Answers cannot be found via web search — they require deep understanding
Cross-domain synthesis Some questions span subdisciplines within a field
Calibration Whether models can accurately assess their own confidence

Example Difficulty

A typical GPQA Diamond question might ask about the outcome of a specific quantum mechanical calculation, the product of a multi-step organic synthesis, or the implications of a particular genetic regulatory mechanism — requiring graduate-level coursework and research experience to answer correctly.

Current Leaderboard

The table below compiles GPQA Diamond accuracy scores from official model announcements and technical reports. All scores are pass@1 (single attempt) unless otherwise noted.

Sources: OpenAI model announcements (o1 blog, o3-mini blog), Google DeepMind (Gemini 2.5 blog), Anthropic model cards, original GPQA paper. Consulted July 2025.

Rank Model Accuracy (%) Source
Human domain experts (PhDs) 65.0 GPQA paper
Non-expert validators (30+ min, web) 34.0 GPQA paper
1 o1 (OpenAI) 77.3 OpenAI o1 blog
2 o3-mini (high) (OpenAI) 77.0 OpenAI o3-mini blog
3 o1-preview (OpenAI) 73.3 OpenAI o1 blog
4 GPT-4o (OpenAI) 50.6 OpenAI o1 blog
5 GPT-4 (OpenAI, 2023 baseline) 39.0 GPQA paper

Key takeaways:

  • o1 was the first AI model to surpass human PhD experts on GPQA Diamond (77.3% vs. 65%), a milestone highlighted by OpenAI
  • o3-mini (high) matches o1 performance at significantly lower cost
  • The gap between non-experts with web access (34%) and experts (65%) confirms questions are genuinely “Google-proof”
  • Even GPT-4o (50.6%) falls short of PhD expert performance, despite being far more capable than the original GPT-4 baseline

Note: More recent models — including o3, Gemini 2.5 Pro, Claude 3.7 Sonnet (extended thinking), and DeepSeek-R1 — have also been evaluated on GPQA Diamond. Google reports Gemini 2.5 Pro as “state-of-the-art” on GPQA. For the latest results, consult the resources listed below.

Where to Explore the Benchmark

Dataset and Code

Resource Description Link
Hugging Face Dataset Full GPQA dataset (Main, Extended, Diamond splits) huggingface.co/datasets/Idavidrein/gpqa
GitHub Repository Evaluation code, baselines, and documentation github.com/idavidrein/gpqa
arXiv Paper Full technical paper with methodology and analysis arxiv.org/abs/2311.12022

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("Idavidrein/gpqa", "gpqa_diamond")

Understanding the Metrics

Pass@1 Accuracy

The primary metric. Each question is a 4-option multiple-choice problem. The model produces a single answer, and accuracy is the fraction of correct responses. Random baseline is 25%.

Consensus@64

Some evaluations (notably OpenAI’s) also report consensus@64: the model generates 64 responses per question, and the final answer is selected by majority vote. This measures the model’s best achievable performance with repeated sampling.

Model Pass@1 Consensus@64
GPT-4o 50.6% 56.1%
o1-preview 73.3% 78.3%
o1 77.3% 78.0%

Key insight: The small gap between pass@1 and consensus@64 for o1 (77.3% vs. 78.0%) suggests the model’s answers are highly consistent — it either knows or doesn’t know, with little variance.

Why GPQA Diamond Matters

graph LR
    A["Expert-level<br/>difficulty"] --> C["GPQA Diamond<br/>as a yardstick"]
    B["Google-proof<br/>questions"] --> C
    C --> D["Measures genuine<br/>scientific reasoning"]
    C --> E["Human-AI<br/>comparison point"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

  1. First benchmark where AI surpassed PhD experts — o1’s 77.3% vs. experts’ 65% was a landmark moment for AI capabilities
  2. Measures deep reasoning, not retrieval — “Google-proof” design ensures models must actually understand the science
  3. Standard evaluation for frontier models — reported by every major AI lab in model release announcements
  4. Clear human baseline — the 65% expert ceiling provides a meaningful reference point
  5. Focused on STEM — targets the science domains most relevant to AI safety and capability concerns

Video: GPQA Diamond Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

GPQA Diamond stands as one of the most important benchmarks in AI evaluation:

  • 198 rigorously vetted questions in biology, physics, and chemistry — double-validated by independent PhD experts
  • PhD-level domain experts score only 65% — and non-experts with full web access score just 34%
  • The first benchmark where AI surpassed human experts — OpenAI’s o1 reached 77.3%, crossing the 65% expert threshold
  • Built at NYU and published at COLM 2024, establishing it as a peer-reviewed standard
  • Remains a key differentiator for frontier models — reported in every major model release

As reasoning-focused models continue to improve, GPQA Diamond provides a critical measure of whether AI systems possess genuine scientific understanding — not just the ability to pattern-match answers from training data.

References

Read More